Explore Prosper Lending Data by Karthik

Univariate Plots Section

##  [1] "Cancelled"              "Chargedoff"            
##  [3] "Completed"              "Current"               
##  [5] "Defaulted"              "FinalPaymentInProgress"
##  [7] "Past Due (>120 days)"   "Past Due (1-15 days)"  
##  [9] "Past Due (16-30 days)"  "Past Due (31-60 days)" 
## [11] "Past Due (61-90 days)"  "Past Due (91-120 days)"

##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

50% of loans, 56576 are Current. Interested in knowing how many are “Active” and how many are “Closed”.

~51.6% of loans are active, meaning they are not Charged Off, Cancelled, Defaulted, and Completed.

For a typical closed loan, how long from Origination to Closed?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -3.79    0.83    1.44    1.62    2.59    4.27   58848

##                     ListingKey           ListingCreationDate
## 108298 DEAA359893047281162F432 2013-12-27 12:02:50.107000000
## 112976 E77E34330098056070F5C30 2008-10-07 17:09:34.653000000
##                 ClosedDate LoanOriginationDate     AgeYears
## 108298 2010-03-16 00:00:00 2013-12-31 00:00:00 -3.793664384
## 112976 2008-10-16 00:00:00 2008-10-17 00:00:00 -0.002739726

Average age in years from Origination to Closed is 1.62, Median Age is 1.44. We can see from the plot that the distribution is multi-modal and slightly left skewed. Quick note: 2 loans have ages below 0. This is not an issue with the age calculation. Indeed, in these case you can see that the LoanOriginationDate > ClosedDate. This may be an instance of some messy data We ignore these 2 loans and re-summarize this variable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.8272  1.4420  1.6230  2.5880  4.2690

It would be interesting to know how these ages vary for different loan statuses (for example, how quickly are loans cancelled?!), and different loan terms. In fact, let’s examine the loan counts by the term.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   40.83   36.00   60.00
## 
##         12         36         60 
## 0.01416572 0.77040821 0.21542607

Another interesting set of dates to compare is ListingCreationDate and LoanOriginationDate. This captures the time it takes from a loan to be listed to receiving the loan from prosper.

## [1] "Average Days"
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.405    4.594    8.175   12.520   12.600 1094.000

##       20%       40%       60%       80%       90%       95% 
##  4.044384  6.610001  9.591654 14.103341 20.250690 37.696425

90% of loans are originated within 21 days of listing. The median borrower is waiting about 1 week and 1 day to get their funds. Kind of confused about how cancellations mix with this. Seems like the loans in this data have all been originated. So, we would not have data about loans that were cancelled before origination (all the loans in this dataset have been originated). Therefore, it’s hard to say if the wait time to origination is correlated with pre-origination cancellations – this is obviously data that Prosper would find to be too sensitive to share. It would be interesting to see the relationship between the duration from listing to origination, and the likelihood of cancellation (after origination).

Moving on, let’s see what are the major listing categories for borrowers who’s loans were originated.

## 
##       Not Available  Debt Consolidation    Home Improvement 
##               16965               58308                7433 
##            Business       Personal Loan         Student Use 
##                7189                2395                 756 
##                Auto               Other       Baby&Adoption 
##                2572               10494                 199 
##                Boat Cosmetic Procedures     Engagement Ring 
##                  85                  91                 217 
##         Green Loans  Household Expenses     Large Purchases 
##                  59                1996                 876 
##      Medical/Dental          Motorcycle                  RV 
##                1522                 304                  52 
##               Taxes            Vacation       Wedding Loans 
##                 885                 768                 771
## 
##       Not Available  Debt Consolidation    Home Improvement 
##        0.1488980753        0.5117564970        0.0652378069 
##            Business       Personal Loan         Student Use 
##        0.0630962725        0.0210203885        0.0066352458 
##                Auto               Other       Baby&Adoption 
##        0.0225738785        0.0921035309        0.0017465792 
##                Boat Cosmetic Procedures     Engagement Ring 
##        0.0007460263        0.0007986870        0.0019045613 
##         Green Loans  Household Expenses     Large Purchases 
##        0.0005178300        0.0175184532        0.0076884594 
##      Medical/Dental          Motorcycle                  RV 
##        0.0133582594        0.0026681412        0.0004563926 
##               Taxes            Vacation       Wedding Loans 
##        0.0077674504        0.0067405672        0.0067668975

For nearly 24% of loans, a listing category is Not Available (15%) or Other (9%). Top three categories besides these are:

I am surpised that Home Improvement is so high on this list - this would seem more like a luxury loan. But perhaps the borrowers that apply for this loan show credit/income conducive to being able to pay this off easily. Finally, it’s pretty interesting to see categories like Taxes/Vacation/Wedding Loans/Engagement Ring (although they only account for <4% of originated loans)

Let’s see a distribution of the loan sizes:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

The minimum loan size is $1,000 and the maximum loan origination amount is $35,000. Seems like the most popular loan sizes in descending order are: - $5,000 - $10,000/$15,000 - $20,000/$25,000

75% of loans are less than $12,000. It would be interesting to see the mean/median loan size for each listing category. For example, I would expect the mean/median business loans to be larger than the mean/median debt consolidation.

Let’s see where our borrowers are usually from.

It’s easy to see that California is the home state for a majority of borrowers. - 5% (5500) of loans have no BorrowerState listed - Top 4 states: California, then Texas/New York/Florida. - Top 4 states account for 31% of all loans

Let’s see some more borrower information: Occupation, EmploymentStatus, EmploymentStatusDuration, IsBorrowerHomeowner

40% of borrowers are Other (25%) or Professional (12%) or Blank (3%). I would expect a little more clarity in this data, as this could impact willingness of investors to fund your loan. After these occupations, the top five are (each between 3% and 4% of total loans): - Computer Programmer - Executive - Teacher - Administrative Assisstant - Analyst

It’s interesting that Prosper has 7 different categories for Students: [Type | Percentage of total loans]

College Juniors/Seniors/Graduate students to be the highest count of loan originations. Freshman/Sophomore students, community college and technical school students have a smaller percentage of the loans. With the current data, we cannot say if this is because of the number of applications, or Prospers loan acceptance process. However, we can check later about which occupations have the best/worst default rates.

## 
##                    Employed     Full-time Not available  Not employed 
##          2255         67322         26355          5347           835 
##         Other     Part-time       Retired Self-employed 
##          3806          1088           795          6134
## 
##                    Employed     Full-time Not available  Not employed 
##   0.019791639   0.590870393   0.231312041   0.046929443   0.007328611 
##         Other     Part-time       Retired Self-employed 
##   0.033404425   0.009549137   0.006977540   0.053836769
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   67.00   96.07  137.00  755.00    7625

## 
##     False      True 
## 0.4955282 0.5044718

Most borrowers are Employed (60%) or Full-Time (23%). <1% for each of the following Not employed/Part-Time/Retired. This is consistent with what I expect.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   67.00   96.07  137.00  755.00    7625

## 
##     False      True 
## 0.4955282 0.5044718

Borrowers for >7000 loans have been employed for less than 6 months in their current job. So, although it seems many loans don’t go to students, borrowers that are relatively new (<6 months) to their job constitute most of the loans. This is a right skewed distribution.

## 
##     False      True 
## 0.4955282 0.5044718

49.5% of borrowers are homeowners. I would guess that disproportionately large percentage of borrowers doing home improvement are home owners (can check this later)!

let’s take a look at a borrowers credit score and utilization:

## [1] 0.06456199
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.310   0.600   0.561   0.840   5.950    7604

Most originated loans go to borrowers with credit scores between 640 and 740. 660 - 700 range constitutes about 28% of loans. 6% of loans are to borrowers with credit below 600 or no credit rating at all.

50% of borrowers have a utilization between .3 and .84. Median BankcardUtilization is around .6. That means, at the time the credit profile was pulled, around 60% of available credit was being used by the median borrower. 7% (7604) of loans had a value of NA for BankcardUtilization. A possible explanation for this is that 7% of originated loans are for borrowers with no credit card at the time of applying for the loan. It’s a very interesting distribution, as there are peaks around 0, and 1, and a long tail after 1 (for people who’s utilization is more than their available credit).

I wonder how much the typical prosper borrower earns, and what percentage are income verifiable?

## 
##      False       True 
## 0.07608591 0.92391409

Most borrowers (56%) in the $25k-$75k annual income. And, 92% of borrowers can prove it (92% are income verifiable). If i’m an investor, its reassuring that the borrower can validate their income source! For those borrowers who’s income is Not Displayed (7%)/Not Employed (<1%), I’m surprised they were approved for a loan and that they got funded.

Speaking of which, since all of these loans are originated, that means they all got full funded, right?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7000  1.0000  1.0000  0.9986  1.0000  1.0120
## [1] 0.0007460263

So, every originated loan is at least 70% funded. The top 75% of loans are fully funded. .07% of loans were not fully funded. So, maybe the platform covers the rest? Not sure how this works.

Back to borrower features: what’s the typical DebtToIncome ratio for a borrower?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

Heavily right skewed distribution with a value of 10.01 representing a ratio >= 1000%.

We may try a log10 transformation of this feature. Since this feature can = 0, we will do a log10 transform on the feature + 1.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

We can see the full data on this scale - most peopl fall have a DebtToIncomeRatio between .1 and 1 Median/Mean is .22/.276 and 50% of borrowers are between .14 and .32.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.4743  0.0000  0.0000  0.2187  0.4823  1.0000

## [1] 69

Mean unweighted loss for closed loans is around 22%. Some times (69 loans, ~.06% of loans), the loss is negative. This occurs when LP_NetPrincipalLoss <0, which occurs mostly for loans from 2007/2008/2009,and may indicate overrepayment from the borrower.

There are many more variables to explore – we can consider some of them in the Bivariate/Multivariate sections

Univariate Analysis

What is the structure of your dataset?

Dataset has 113,937 observations of 82 variables. Each observation is of a loan on the prosper platform originated between late 2005 and early 2014. There are many types variables in this dataset: Datetime, Categorical (ordered, un-ordered), Binary, and Numerical.

Datetime: ListingCreationDate, LoanOriginationDate, ClosedDate

Categorical: ProsperRating(Alpha) - ordered categorical variable, LoanStatus/ListingCategory/BorrowerState/Occupation/EmploymentStatus/…

Binary: IsBorrowerHomeowner, IncomeVerifiable

Numerical: LoanOriginationAmount/BankcardUtilication/RevolvingCreditBalance

What is/are the main feature(s) of interest in your dataset?

My main feature of interest is the Loss Rate; LP_NetPrincipalLoss/LoanOriginalAmount. This feature is of interest in terms of understanding how well Prosper is doing in terms of minimizing chargedoff/defaulted loans and how much loss prosper has suffered from these outcomes.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think we would need to consider at least the following features.

About the Loan:

  • Term
  • BorrowerAPR
  • LoanOriginationAmount

About the Borrower:

  • Occupation
  • EmploymentStatus
  • EmploymentStatusDuration
  • IsBorrowerHomeowner
  • Recommendations
  • IncomeRange
  • IncomeVerifiable
  • CreditScoreRange
  • TotalProsperLoans
  • DebtToIncomeRatio

About the lending:

  • InvestmentFromFriendsCount
  • InvestmentFromFriendsAmount
  • Investors

There are probably more features in this dataset that are relevant, but we may consider just a subset of these for simplicity.

Did you create any new variables from existing variables in the dataset?

Yes:

  1. isActive - is the loan “Closed” or currently open.
  2. AgeYears - time (in years) from LoanOriginationDate to ClosedDate. NA if loan is still open.
  3. AgeDaysToOrigination - time (in days) from ListingCreationDate to LoanOriginationDate
  4. BorrowerStateFull - full name of state (in order to project the count of loans by state onto the US map)
  5. CreditScoreRange - concat the CreditScoreLower and CreditScoreUpper to create one variable that captures the the CreditScoreRange of the borrower
  6. Loss Rate - LP_NetPrincipalLoss/LoanOriginalAmount - get how much of a loan has been lost on a per loan basis.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes:

  • BankcardUtilization’s density plot shows a bimodal distribution and a long tail. I did not change this data.
  • I did a Log10 transform to the DebtToIncomeRatio distribution because it was heavily right skewed by a few individuals that have >1000% ratio.

Bivariate Plots Section

Let’s start out with the 2 features of interest – Prosper’s Score and the LoanStatus.

Okay, the bottom right plot shows that there’s a bunch of loans with no Proser Rating. So, prosper introduced this score July 2009. Therefore, all loans before this time have no “prosper rating”. We may how many loans have no prosper rating, and also see the CreditGrade for these pre-july 2009 loans.

## 
##    AA     A     B     C     D     E    HR    NC       
##  3509  3315  4389  5649  5153  3289  3508   141 84984
## Source: local data frame [9 x 4]
## 
##   CreditGrade lower upper count
##        (fctr) (int) (int) (int)
## 1          AA   760   899  3509
## 2           A   720   759  3315
## 3           B   680   719  4389
## 4           C   640   679  5649
## 5           D   600   639  5153
## 6           E   540   599  3289
## 7          HR     0   559  3508
## 8          NC     0    19   141
## 9               600   899 84984

two interesting things to note here:

So, for borrowers with a credit score range (not NA_NA), you can see a clear delineation between CreditGrades based on the CreditScore. This correlation definitely gives a lot of evidence to indicate that up until July 2009, Prosper graded loan applications almost exclusively based on the CreditScore range.

Let’s go a step further, how did the previous CreditGrade system fair in terms of defaults?

## Source: local data frame [8 x 5]
## 
##   CreditGrade total cancelled completed chargeoff_or_default
##        (fctr) (int)     (dbl)     (dbl)                (dbl)
## 1          AA  3509         0      0.85                 0.15
## 2           A  3315         0      0.76                 0.24
## 3           B  4389         0      0.68                 0.32
## 4           C  5649         0      0.64                 0.36
## 5           D  5153         0      0.61                 0.39
## 6           E  3289         0      0.51                 0.49
## 7          HR  3508         0      0.39                 0.61
## 8          NC   141         0      0.30                 0.70

The distribution of completions/defaults across CreditGrade look consistent with what an investor would expect. Nevertheless, having a AA loans default around 12-17% of the time is not ideal. Perhaps this number was lowered after the introduction of the Prosper Rating.

Okay, so how has the new Prosper Rating done in terms of mitigating defaults for different grades? (We will only consider closed loans). Predicting whether an open loan will be completed/defaulted is a whole other problem itself, and we don’t have the right data for it.

Defaults have decreased across all categories of ratings. So, i think the new rating system is performing better than the old one. Perhaps Prosper improved their underwriting process with data collected from their historical loans.

Default rate is just part of it. If 1% of loans default, but those 1% are in the 99th percentile in terms of loan originations, then prosper could still have lost as much money as in the case when a larger percentage of small to midsized loans default. So, let’s take a look at the LP_NetPrincipalLoss per loan for different grades. We do “per loan” to account for the fact that are many more closed loans pre july 2009, so we want the per loan loss, rather than the absolute loss.

## Source: local data frame [2 x 6]
## 
##   isPreJuly2009Loan total total_net_loss total_originated loss_per_loan
##               (lgl) (int)          (dbl)            (int)         (dbl)
## 1             FALSE 26005       31202559        165801431      1199.868
## 2              TRUE 29084       46436448        179137169      1596.632
## Variables not shown: loss_rate (dbl)

So, it’s clear that, for all closed loans, between Pre/Post July 2009, there’s an improvement of:

Let’s continue with this loss_per_loan and loss_rate metric, and see how it varies across different loan types. Which types are more unreliable than others?

Loss Rate Below 10%: RV, Motorcycle, Engagement Ring, Boat Loss Rate Above 30%: Household Expenses Caveats: these categories account for very little of the total originations.

Debt Consolidation/Not Available/Business Loans/etc account for more of the loan originations, and their loss rates are between 18-30%.

Is credit score a good proxy for understanding who will default? let’s take a look at the loss rate by the credit score

Wow, it’s almost linear. I would’ve expected there to be more variation. So, what would be interesting to see is if the Post July 2009 loans (for which credit grades are assigned using Prospers own metric) are not as correlated with the credit score. This would indicate that prosper has found other data to properly assess risk. Quick aside: What’s the difference between Borrower APR and Borrower Rate?

## 
## Call:
## lm(formula = BorrowerAPR ~ BorrowerRate, data = subset(df, CreditGrade != 
##     ""))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.014134 -0.004627 -0.002311  0.004586  0.045466 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.504e-03  9.596e-05   67.77   <2e-16 ***
## BorrowerRate 1.033e+00  4.847e-04 2131.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.00614 on 28926 degrees of freedom
##   (25 observations deleted due to missingness)
## Multiple R-squared:  0.9937, Adjusted R-squared:  0.9937 
## F-statistic: 4.542e+06 on 1 and 28926 DF,  p-value: < 2.2e-16

Borrower APR is basically a linear function of Borrower Rate. See the R^2 of .99

APR includes fees that Prosper collects, while Rate excludes these fees. From the regression, we can see that the fees Prosper collects increase linearly with respect to increases in Borrower Rates. Tony Soprano probably wasn’t this nice :)

Back to loss rate. How about examining this by: Borrower State/IsBorrowerHomeowner/IncomeRange/DebtToIncomeRatio

Wyoming has an exceedingly low loss rate - we can see that their total originated loan amount is quite low. California, which originates the highest volumn of loans has a loss rate that is slightly less than 25%.

IsBorrowerHomeowner/IncomeRange

Loss rate doesn’t differ too much across IsHomeownerBorrower.

Loss rate falls as income increases - that is to be expected.

Debt to Income Ratio vs Loss Rate

## 
## Call:
## lm(formula = LP_NetPrincipalLoss/LoanOriginalAmount ~ log10(DebtToIncomeRatio + 
##     1), data = subset(df, ClosedDate != ""))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6714 -0.2144 -0.2027  0.2362  0.8114 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  0.187328   0.002259   82.94   <2e-16 ***
## log10(DebtToIncomeRatio + 1) 0.290199   0.017370   16.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3549 on 50857 degrees of freedom
##   (4230 observations deleted due to missingness)
## Multiple R-squared:  0.005459,   Adjusted R-squared:  0.005439 
## F-statistic: 279.1 on 1 and 50857 DF,  p-value: < 2.2e-16

From the scatter plots, it seems like there is some positive correlation for high enough values of log10(DebtToIncomeRatio+1) and the loss rate. But overall, this explanatory variable does not explain much of the variation in the loss rate. In fact, based on the R^2 value, the DebtToIncomeRatio explains about .5% of the variation in the LP_NetPrincipalLoss/LoanOriginalAmount. That’s extremely low.

What about the Loss Rate across: Loan Size, Term, BorrowerAPR

## [1] -0.1174785

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Pre July 2009, CreditGrade was a direct function of CreditRange. Post July 2009, it is a bit more complex to figure out what constitutes the Prosper Rating.

Loss Rate was examined against many different features.

  • Loss Rate seems to have improved after Prosper introduced their new credit score. The loss per loan and loss rate both fell within and across loan Grades.
  • It is inversely correlated with income range and credit grade.
  • Seems to have little or no relationship to whether the borrower is a home owner, borrower state, and the DebtToIncomeRatio.
  • positively correlated with Term length

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes:

  • BorrowerRate can be approximated as a linear function of BorrowerAPR with an R^2 of .99. This tells us that BorrowerRate grows linearly with as the APR increases.
  • Also, pre July 2009, the CreditGrade has a clear relationship to the Borrower Credit Range. So, before Prosper introduced their Score, they pretty much just used the CreditRange to grade a loan.
  • The longer the term, the larger the loan. 12/36 month loans around the same median loan size. For 60 month loans, loan size is 2.6x larger.

What was the strongest relationship you found?

  • In terms of R^2, the strongest relationship found was between BorrowerRate and BorrowerAPR. But this is a bit tangential to the analysis
  • The CreditGrade’s relationship to the CreditRange says a lot of about the risk assessment before July 2009 - this was also a very strong relationship.
  • In terms of our feature of interest, LossRate, the strongest relationship is between LossRate and the Term of the loan. As the term length increases, the loss rate increases.

Multivariate Plots Section

Let’s dig further into the Loss Rate’s relationship across listing categories weighted by the total volume of loans originated from the listing.

The first bubble chart shows us the Loss Rate across listings with the bubbles sized by the log10 of the total volume of loans originated for that listing. We can see that the largest bubbles typically have loss rates in the 15-30% range. The categories with low loss rates have small bubbles, indicating a very small proportion of originated loans are from listings in these categories.

We break this chart up into 3, 1 for each Term size for a loan. It’s quite clear that regardless of the listing category, the loss rate seems to increase as the loan term increased. We can see that the loss rates for some categories such as Home Improvement have dramatic leaps as we go from 12->36->60 month loans. Debt Consolidation, the largest “Available” category by loan volume originated, has a Loss Rate that goes from 3% (12 month) -> 18% (36 month) -> 26% (60 month). This trajectory, particularily the increase from 36 month to 60 month, is actually better than the trajectories for Home Improvement/Household Expenses, and many other categories.

Let’s consider the loss rate in terms of employment status

Once again, independent of employment status, loss rate tends to be higher with longer term loans.

Let’s consider the loss rate in terms of the term and income range

For high incomes ($100,000+), the loss rate seems to level out at 36 months at ~22.5%. So, in terms of just the loss rate, Prosper could be indifferent between giving a borrower with an income of $100,000 a loan that lasts 36 months/60 months. This information is important, as prosper can adjust it’s sales/fees approach to make the loan product that will be most suitable to ‘high-income borrowers’.

loss rate across term and credit range

Loss rate increases across terms and decreases within a term across credit range.

Loss rate across credit range and income range

Loss rate increases across income ranges, and decreases within an income range across credit ranges.

Finally, let’s try a simple model to predict loss rate

## 
## Call:
## lm(formula = LossRate ~ BorrowerRate + Term + CreditRange + IncomeRange, 
##     data = subset(df, ClosedDate != ""))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9594 -0.2356 -0.1197  0.2184  1.0600 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                0.2095849  0.0343066   6.109 1.01e-09 ***
## BorrowerRate               0.9874090  0.0214048  46.130  < 2e-16 ***
## Term                       0.0041755  0.0001995  20.930  < 2e-16 ***
## CreditRange360 _ 379      -0.0411836  0.3390156  -0.121  0.90331    
## CreditRange420 _ 439       0.1632677  0.1538580   1.061  0.28862    
## CreditRange440 _ 459       0.1339846  0.0634552   2.111  0.03474 *  
## CreditRange460 _ 479       0.0712817  0.0408267   1.746  0.08082 .  
## CreditRange480 _ 499       0.0210634  0.0344599   0.611  0.54104    
## CreditRange500 _ 519      -0.0700307  0.0326144  -2.147  0.03178 *  
## CreditRange520 _ 539      -0.1380325  0.0306904  -4.498 6.89e-06 ***
## CreditRange540 _ 559      -0.1901940  0.0306766  -6.200 5.69e-10 ***
## CreditRange560 _ 579      -0.2245286  0.0308943  -7.268 3.71e-13 ***
## CreditRange580 _ 599      -0.2571038  0.0311834  -8.245  < 2e-16 ***
## CreditRange600 _ 619      -0.2622299  0.0301587  -8.695  < 2e-16 ***
## CreditRange620 _ 639      -0.2846251  0.0301350  -9.445  < 2e-16 ***
## CreditRange640 _ 659      -0.2899303  0.0299322  -9.686  < 2e-16 ***
## CreditRange660 _ 679      -0.3120850  0.0299634 -10.416  < 2e-16 ***
## CreditRange680 _ 699      -0.3070875  0.0299682 -10.247  < 2e-16 ***
## CreditRange700 _ 719      -0.3074518  0.0300353 -10.236  < 2e-16 ***
## CreditRange720 _ 739      -0.3098907  0.0300924 -10.298  < 2e-16 ***
## CreditRange740 _ 759      -0.3182031  0.0302173 -10.530  < 2e-16 ***
## CreditRange760 _ 779      -0.3122257  0.0303663 -10.282  < 2e-16 ***
## CreditRange780 _ 799      -0.3266017  0.0306216 -10.666  < 2e-16 ***
## CreditRange800 _ 819      -0.3312094  0.0311465 -10.634  < 2e-16 ***
## CreditRange820 _ 839      -0.3428484  0.0320105 -10.711  < 2e-16 ***
## CreditRange840 _ 859      -0.3504755  0.0345278 -10.151  < 2e-16 ***
## CreditRange860 _ 879      -0.3406053  0.0394299  -8.638  < 2e-16 ***
## CreditRange880 _ 899      -0.3691003  0.0779624  -4.734 2.20e-06 ***
## CreditRangeNA _ NA        -0.2814501  0.0324391  -8.676  < 2e-16 ***
## IncomeRange$0              0.0145959  0.0200974   0.726  0.46768    
## IncomeRange$1-24,999      -0.0327570  0.0155077  -2.112  0.03466 *  
## IncomeRange$25,000-49,999 -0.0417113  0.0149126  -2.797  0.00516 ** 
## IncomeRange$50,000-74,999 -0.0716346  0.0149735  -4.784 1.72e-06 ***
## IncomeRange$75,000-99,999 -0.0823963  0.0152675  -5.397 6.81e-08 ***
## IncomeRange$100,000+      -0.0824090  0.0153164  -5.380 7.46e-08 ***
## IncomeRangeNot displayed  -0.0364875  0.0154771  -2.358  0.01840 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3377 on 55053 degrees of freedom
## Multiple R-squared:  0.1159, Adjusted R-squared:  0.1154 
## F-statistic: 206.2 on 35 and 55053 DF,  p-value: < 2.2e-16

Using the BorrwowerRate, Term, CreditRange, and IncomeRange to predict the unweighted loss rate for closed loans, we can see that 11.59% of the variation in the unweighted loss rate is explained by these factors. There is many more columns yet to be explored and subsequent analysis it would worthwhile to consider these factors and see if they can improve the predictive power of this model.

Maps

Will take a detour here and examine maps on Prosper in terms of loan count, originations (current and over time), and loss_rate across states.

Loan counts in various states

As indicated earlier, we see california originates the most loans (by count), followed by texas/new york/florida. Let’s see how the loss_rate and origination volume differs across states.

Again, as indicated previously, we see that wyoming has the lowest loss rate, but perhaps originates relatively low $’s worth of loans. Of the big 4 states (in terms of loan originations); california/texas/new york/florida - it seems that new york has the lowest loss rate. Finally, let’s see how the loan originations volume has changed over the years.

Prosper’s originations started primarily in its home state of California, and then experienced success in Texas, and many states along the East Coast. Since this data is last updated midway through 2014, the intensity of the 2014 heat map is less than that of 2013 – but, i suspect that by the close of 2014, there is probably more penetration by Prosper in the various parts of the US market.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In this part of the investigation, I looked at how Loss Rate and Loan Originations vary across Listing Categories, Terms, Employment Statueses, IncomeRange and CreditRange. Also, I considered how Prosper’s loan originations have grown across the continental US by year since inception.

In the multivariate analysis, i could not find features that strengthened each other. Relationships such as Term and Listing Category did not strengthen each other when examined against the Loss Rate. Terms/IncomeRange had some interesting interactions (described in the next section), but again, the features did not seem to strengthen each other.

Were there any interesting or surprising interactions between features?

Yes - For high income borrowers, 36 month and 60 month loss rates were the same. The expectation was that 60 month loss rate is higher than 36 month loss rate regardless of the Income.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a simple model to predict the unweighted LossRate for a loan given the:

  • BorrowerRate
  • Term
  • CreditRange
  • IncomeRange

The model could only explain 11.59% of the variation in the LossRate. There are clearly many latent variables that have not been considered and incorporated into the model. There are some potentially useful features however, like IncomeRange, Term, and BorrowerRate already in the model. So, perhaps this serves as a good foundation to build a more comprehensive linear regression.

Final Plots and Summary

Plot One

Description One

The first plot shows grouped bar chart of the Loss Rate by the Loan Grade (variable created from the CreditGrade and ProsperRating), grouped by whether the loan was originated Pre/Post July 2009. This chart helps us understand how, from an investor standpoint, the change in the Prosper rating system affected loss rates from each Loan Grade.

We can see the expected trends, that loss rate increases as the Loan Grade falls (AA>A>….>HR). This trend holds true for both Pre July 2009 (CreditGrade) loans and Post July 2009 (ProsperRating) loans. Additionally, and more importantly, holding the Loan Grade constant, there is at least a 5% improvement in loss rate across all Loan Grades. With the new rating:

  • AA/A/B/HR improve 10%-15% in loss rate
  • C/D/E grades improve 5%-10% in loss rate

So, this kind of plot shows that loss rates fall for each Loan Grade in the new rating system, and should therefore improve investor confidence in prosper’s underwriting models.

Plot Two

Description Two

Here, we see a grouped bar chart on the Loss Rate by Income Ranges, group by the Term length of the loans. As Income Range increases, the Loss Rate falls for every Term length. Additionally, the variation in the loss rate across Income Ranges increases as Term size increases. Holding Income Range constant, we see that Loss Rate falls as Term length increases. One important caveat, for high-income borrowers (>$100,000) of 36/60 month loans, the loss rate is nearly the same, at around 22%.

Plot Three

Description Three

The plot shows a series of heat maps of the USA, with the variable of interest being Loan Originations by Year.

Excluding 2014 (do not have a full year’s worth of data), prosper loans originations seem to increase across the US states over time. We see that as of 2013, there are no loans in North Dakota, Iowa and Maine. In the past, take 2008 for instance, there were loan originations in these states. In fact, according to this source (http://www.lendingmemo.com/lending-club-and-prosper-states/), 3 states are not open to p2p lending: North Dakota, Iowa and Maine. So, legislation has actually closed these states off from certain lending marketplaces. Aside from this, we can see that California, Texas, New York and Florida are the hotbeds of loan originations. It’s no coincidence that these states also have the highest electoral counts. This is probably more so a testament to the large populations of these states, rather than the level of “p2p lending interest” in these regions (compared to others).

Reflection

This dataset has 81 variables. This makes it very challenging to do EDA on it without having much domain knowledge about this space. It was difficult to get started with so many variables, but the key to moving forward was having a business scenario. Focusing on the loss rate as a metric provided a fixed point and helped guide the analysis. I had a hypothesis that various features would be predictive of the loss rate, and built a linear model with these features, but the outcome showed that the chosen features did not explain much of the variance in the loss rate. Given more time, it would prudent to look deeper into other features of these dataset; especially those describing more detailed info about the borrowers credit history: open lines of credit, inquires, closed lines, etc. However, I was able to garner some insights and generate 3 final charts that illustrate them.